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1.  INTRODUCTION 

As  HPC  applications  run  on  increasingly  high  process 
counts  on  larger  and  larger  machines,  both  the  frequency 
of  checkpoints  needed  for  fault  tolerance  [14]  and  the  res¬ 
olution  and  size  of  Data  Analysis  Dumps  are  expected  to 
increase  proportionally.  In  order  to  maintain  an  acceptable 
ratio  of  time  spent  performing  useful  computation  work  to 
time  spent  performing  I/O,  write  bandwidth  to  the  under¬ 
lying  storage  system  must  increase  proportionally  to  this 
increase  in  the  checkpoint  and  computation  size.  Unfortu¬ 
nately,  popular  scientific  self-describing  file  formats  such  as 
netCDF  [8]  and  HDF5  [3]  are  designed  with  a  focus  on  porta¬ 
bility  and  flexibility.  Extra  care  and  careful  crafting  of  the 
output  structure  and  API  calls  is  required  to  optimize  for 
write  performance  using  these  APIs.  To  provide  sufficient 
write  bandwidth  to  continue  to  support  the  demands  of  sci¬ 
entific  applications,  the  HPC  community  has  developed  a 
number  of  I/O  middleware  layers,  that  structure  output  into 
write-optimized  file  formats.  However,  the  obvious  concern 
with  any  write  optimized  file  format  would  be  a  correspond¬ 
ing  penalty  on  reads.  In  the  log-structured  filesystem  [13], 
for  example,  a  file  generated  by  random  writes  could  be  writ¬ 
ten  efficiently,  but  reading  the  file  back  sequentially  later 
would  result  in  very  poor  performance.  Simulation  results 
require  efficient  read-back  for  visualization  and  analytics, 
and  though  most  checkpoint  files  are  never  used,  the  effi¬ 
ciency  of  a  restart  is  very  important  in  the  face  of  inevitable 
failures.  The  utility  of  write  speed  improving  middleware 
would  be  greatly  diminished  if  it  sacrificed  acceptable  read 
performance.  In  this  paper  we  examine  the  read  perfor¬ 
mance  of  two  write-optimized  middleware  layers  on  large 
parallel  machines  and  compare  it  to  reading  data  natively 
in  popular  file  formats. 

The  two  I/O  middleware  layer  examined  in  this  paper  are 
the  Adaptable  10  System  (ADIOS)  [12,  9],  a  library-based 
approach  developed  at  Oak  Ridge  National  Laboratory  to 
provide  a  high-level  10  API  that  can  be  used  in  place  of 
netCDF  or  HDF5  to  do  much  more  aggressive  write-behind 
and  efficient  reordering  of  data  locations  within  the  file;  and 
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the  Parallel  Log-structured  Filesystem  (PLFS)  [5],  a  stack- 
able  FUSE  [I]  filesystem  approach  developed  at  Los  Alamos 
National  Laboratory  that  decouples  concurrent  writes  to  im¬ 
prove  the  speed  of  checkpoints.  Since  ADIOS  is  an  I/O  com- 
ponentization  that  affords  selection  of  different  I/O  methods 
at  or  during  runtime,  through  a  single  API,  users  can  have 
access  to  MPI-IO,  Posix-IO,  HDE5,  net  ODE,  and  staging 
methods  [4].  The  ADIOS  BP  file  format  [10]  is  a  new  log-file 
format  that  has  a  superset  of  the  features  of  both  HDE5  and 
net  ODE,  but  is  designed  to  be  portable  and  flexible  while 
being  optimized  for  writing.  PLES,  on  the  other  hand,  is 
mounted  as  a  stackable  filesystem  on  top  of  an  existing  par¬ 
allel  filesystem.  Reads  or  writes  to  the  PLES  filesystem  are 
transparently  translated  into  operations  on  per-process  log 
files  stored  in  the  underlying  parallel  filesystem.  Since  PLES 
performs  this  translation  without  application  modification, 
users  can  write  in  HDE5,  netCDE,  or  app-specific  file  for¬ 
mats  and  PLES  will  store  the  writes  in  a  set  of  efficiently 
written  log-formatted  files,  while  presenting  the  user  with  a 
logical  ‘flat’  file  on  reads.  Despite  their  different  approaches, 
the  commonality  behind  both  of  these  middleware  systems 
is  that  they  both  write  to  a  log  file  format.  As  shown  in 
previous  publications  [II,  5],  writes  are  fully  optimized  in 
both  systems,  sometimes  resulting  in  lOOx  improvements 
over  writing  data  in  popular  file  formats.  But  as  mentioned 
above,  writing  is  only  one  part  of  the  story.  In  this  paper  we 
examine  the  read  performance  of  our  middleware  layers  on 
large  parallel  machines  and  compare  these  to  reading  data 
either  natively  or  from  other  popular  file  formats.  We  com¬ 
pare  the  reading  performance  in  two  different  scenarios:  I) 
Reading  back  restarts  from  the  same  number  of  processors 
as  wrote  the  data  and  2)  Reading  back  restart  data  from  a 
different  number  of  processors. 

We  observe  that  not  only  can  write-optimized  I/O  middle¬ 
ware  be  built  to  not  penalize  read  speeds,  but  for  important 
workloads,  techniques  that  improve  write  performance  can, 
perhaps  counterintuitively,  improve  read  speeds  over  reading 
from  a  contiguously  organized  file  format.  In  the  remainder 
of  this  paper,  we  investigate  this  further  through  case  studies 
of  PLES  and  ADIOS  on  simulation  checkpoint  restarts. 

2.  PLFS 

As  described  above,  the  Parallel  Log- Structured  Eilesys- 
tem  (PLES)  is  a  EUSE-based  stackable  filesystem  that  ac¬ 
celerates  the  write  performance  of  N-I  checkpoints  by  de¬ 
coupling  the  concurrent  access  of  multiple  processes  writing 
to  a  single,  shared  file  into  many  non-concurrent  writes  ap¬ 
pending  to  per-process  log  files. 
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Figure  1:  Progress  of  sequential  reading  of  a  checkpoint  log  file 


Filesystems  perform  well  on  sequential  access  to  single¬ 
writer  files  compared  to  strided,  concurrent  writes  to  a  shared 
file.  This  is  true  even  of  parallel  filesystems;  although  they 
spread  I/O  across  multiple  servers  and  spindles,  parallel 
filesystems  still  must  deal  with  contention  within  in  a  sin¬ 
gle  file  object  or  stripe.  So  it  is  unsurprising  that  PLFS’s 
log-structured  writing  has  achieved  large  write  performance 
improvements  on  real  applications  and  checkpoint  bench¬ 
marks  [5].  However,  for  pathological  combinations  of  write 
and  read  patterns  (for  example,  a  random  write  pattern  read 
back  sequentially),  a  log  file  format  is  expected  to  exhibit 
poor  read  performance  due  to  additional  seeks.  Surprisely, 
this  expected  poor  read  performance  was  not  borne  out  dur¬ 
ing  PLFS  experiments. 

To  understand  why  PLFS  performs  so  well  on  its  read 
path,  it  helps  to  understand  that  while  reading  sequentially 
from  a  log  file  format  consisting  of  random  writes  can  be 
slow,  the  checkpointing  and  scientific  applications  that  write 
to  PLFS  do  so  in  a  structured  manner.  An  analysis  of  the 
trace  repository  of  checkpoints  run  through  PLFS  available 
on  the  web  [2]  reveals  two  important  observations.  First,  the 
multiple  processes  writing  a  checkpoint  do  interleave  their 
writes  with  one  another;  this  is  the  behavior  that  allows  mid¬ 
dleware  layers  such  as  PLFS  and  ADIOS  to  improve  write 
bandwidth.  The  second  observation  is  that  all  writes  from 
each  individual  process  are  strictly  increasing  in  their  logical 
offsets. 

Figure  1  is  a  graphical  representation  of  three  time  steps 
of  a  single  client  reading  sequentially  from  a  PLFS  check¬ 
point  file  created  by  three  writing  processes  and  stored  as  log 
structured  files  on  an  underlying  parallel  filesystem.  Squares 
represent  segments  of  client  memory  or  entries  in  a  log  file, 
and  the  numbers  inside  correspond  to  their  logical  offset. 
Grey  boxes  represent  the  parallel  filesystem’s  read-ahead 
buffers  for  each  log.  Note  that  each  log  file  is  monotoni- 
cally  increasing  in  logical  offset,  and  the  next  request  from 
the  client  is  always  at  the  front  of  one  of  the  log  files.  Due  to 
this  property,  as  the  client  continues  reading  from  the  check¬ 
point  file,  rather  than  the  log  format  resulting  in  expensive 
seeks,  the  read  ahead  buffers  will  slide  forward  through  the 
log  files,  reading  them  sequentially  in  parallel.  ADIOS ’s  BP 
format  is  somewhat  different,  but  by  storing  variables  to¬ 
gether  in  log  formatted  files,  it  too  allows  for  efficient  read 
back  in  an  analogous  manner. 

Below  we  discuss  how  this  property  allows  for  efficient 
read-back  of  PLFS  checkpoint  files  on  uniform  and  non- 
uniform  restart,  by  examining  one  checkpoint  benchmark: 


The  mpi_io_test  benchmark  from  LANL  [7].  This  bench¬ 
mark  is  designed  to  represent  a  simple  checkpoint  I/O  work¬ 
load.  In  the  examples  below,  it  is  configured  to  write  a  single 
20  GB  file  in  47KB  strided  writes  from  a  varying  number  of 
processes  on  the  LANL  Roadrunner  system.  One  implica¬ 
tion  of  this  workload  is  that  as  the  number  of  writers  grows, 
the  amount  written  by  each  writer  (and  the  corresponding 
size  of  the  PLFS  log  file)  shrinks.  We  write  both  to  PLFS 
and  to  the  underlying  parallel  filesystem  directly  and  com¬ 
pare  the  results. 
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Figure  2:  Uniform  Reads  for  PLFS 


2.1  Uniform  Restart  in  PLFS 

Reading  back  a  checkpoint  file  with  the  same  number  of 
processes  is  highly  efficient  in  PLFS.  During  a  checkpoint 
restart,  each  reading  process  will  read  back  one  writing  pro¬ 
cess’s  data  from  the  checkpoint  in  the  same  pattern  in  which 
it  was  written.  In  the  case  of  PLFS,  since  the  checkpoint 
was  written  in  per-process  monotonically  increasing  offsets, 
each  reader  will  access  a  single  log  structured  file  in  sequen¬ 
tial  order.  The  scenario  is  similar  to  that  shown  in  Figure  1, 
except  with  multiple  clients  each  reading  individual  log  files 
allowing  the  underlying  parallel  filesystem  to  make  efficient 
use  of  read  ahead  buffers  and  disks.  Results  are  shown  in 
Figure  2.  Performance  quickly  ramps  up,  but  gradually  falls 
off  as  the  numbers  of  writers  for  the  20  GB  file  increases, 
the  size  of  each  individual  log  file  shrinks,  and  the  benefit  of 


prefetching  is  reduced. 

By  contrast,  all  processes  reading  directly  from  a  single  file 
stored  on  the  parallel  filesystem  elicits  poor  performance  as 
multiple  readers  are  at  any  given  time  all  issuing  a  series 
of  small  reads  to  a  relatively  narrow  window  that  gradually 
moves  through  the  file.  The  result  is  possible  contention 
within  the  file,  more  seeks,  and  poor  utilization  of  spindles 
and  servers  (since  a  narrow  region  of  the  file  is  generally 
not  spread  as  widely  across  disks  and  servers  as  multiple  log 
structured  files).  The  result  is  that  the  read  back  speed  of 
readers  without  PLFS  does  not  scale  noticeably  with  more 
processes. 
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Figure  3:  Non-Uniform  Reads  for  PLFS 


2.2  Non-Uniform  Restart  in  PLFS 

Reading  back  a  checkpoint  file  with  fewer  processes  is 
somewhat  less  efficient  in  PLFS,  as  shown  in  Figure  3.  Here 
a  file  is  written  by  a  varying  number  of  processes  and  read 
back  by  eight  fewer  reader  processes.  The  situation  is  simi¬ 
lar  to  the  previous  section,  however,  now  each  log  structured 
file  is  accessed  by  multiple  processes  instead  of  just  one.  The 
result  is  some  additional  read  contention  within  each  PLFS 
log  file. 

Reading  directly  from  a  single,  flat  file  stored  on  the  un¬ 
derlying  parallel  filesystem  is  analogous  to  the  uniform  case, 
however,  and  once  again  we  see  it  exhibit  a  non-scaling  read 
performance. 

3.  ADIOS 

The  ADIOS  architecture  is  designed  for  portable,  scalable 
performance  in  two  key  ways.  First,  the  component izat ion  of 
the  10  implementation  affords  selecting  the  highest  perfor¬ 
mance  output  mechanism  for  a  particular  platform  without 
requiring  any  source  code  changes.  Second,  the  default  BP 
hie  format  achieves  excellent  write  performance  by  allowing 
delaying  consistency  checks  [10].  This  enables  aggressive 
buffering  and  decouples  processes  during  the  output  pro¬ 
cess.  The  output  format  is  organized  around  the  concept  of 
a  Process  Group^  a  collection  of  data  typically  written  by 
a  single  process,  similar  in  spirit  to  PLFS’s  per- writer  logs. 
Each  process  is  assigned  a  section  of  the  hie  to  write  local 
data  with  sufficient  annotations  to  identify  the  items  writ¬ 
ten  and  place  the  portions  of  global  items  properly  in  the 


global  space.  For  example,  global  arrays  are  written  as  a 
local  array  in  the  assigned  process  group  with  annotation  of 
the  global  array  it  is  part  of,  the  global  dimensions,  the  local 
dimensions,  and  the  offsets  into  the  global  space  this  piece 
represents.  Scalars,  other  local  values,  and  data  attributes 
are  stored  similarly.  This  format  decouples  processes  and 
data  reorganization  when  writing  by  avoiding  constructing 
a  contiguous  global  array  on  disk  when  writing  and  selecting 
a  slice  appropriate  for  the  local  process  on  read,  assuming 
the  same  number  of  processes  and  decomposition  are  used 
when  reading.  The  perceived  potential  penalty  for  the  gain 
in  write  performance  for  this  decision  is  the  inability  to  read 
the  data  back  in  efficiently.  For  restarts  of  a  p  process  output 
read  back  in  on  p  processes,  this  is  an  optimal  format.  Each 
process  reads  the  data  from  exactly  one  process  group  in¬ 
dependently  and  requires  no  data  reorganization.  Eor  other 
process  configurations,  the  performance  story  was  unknown. 
This  evaluation  resolves  the  restart  performance  question. 

3.1  Evaluation 

Eour  different  experiments  are  performed  on  one  of  the 
most  difficult  data  decompositions,  a  3-D  domain  decompo¬ 
sition.  Eor  this  application,  the  simulation  domain  is  a  3-D 
space  divided  into  boxes  along  the  X,  Y,  and  Z  dimensions 
with  each  process  responsible  for  one  rectangular  area.  Eor 
netCDE  and  HDE5  formatted  data,  this  output  is  reorga¬ 
nized  into  a  contiguous  format  for  the  entire  3-D  domain 
with  selective  reading  on  restarts.  The  BP  format  stores 
each  rectangular  area  as  a  single  block  in  a  process  group 
with  each  area  in  a  non-contiguous  space  in  the  file.  In  or¬ 
der  to  read  the  entire  3-D  domain  from  a  single  process,  in 
order,  the  system  would  need  to  skip  around  the  file  or  read 
in  chunks  reorganizing  in  memory.  An  10  kernel  for  the 
Pixie3D  MHD  code  [6]  is  tested. 

Pixie3D  has  three  modes  of  operation  tested  for  this  study. 
The  Small  configuration  consists  of  256  KB  of  data  per  pro¬ 
cess,  Medium  has  16  MB  per  process,  and  128  MB  per  pro¬ 
cess  for  Large.  The  data  is  divided  up  evenly  among  8  vari¬ 
ables  and  also  contains  many  small  scalars  values.  To  test 
the  restart  read  performance,  various  process  counts  from 
128  to  2048  are  employed  to  write  the  data  with  the  same 
or  one-half  the  number  of  process  used  to  write  to  read  the 
data  back  in.  netCDE  formatted  data  is  compared. 

Eor  all  tests,  the  best  results  for  a  series  of  four  runs  for 
each  data  point  is  selected  for  the  BP  and  netCDE  perfor¬ 
mance.  The  horizontal  axis  represents  the  number  of  writers 
employed  to  generate  the  restart  data  no  matter  the  number 
of  processes  used  to  read  the  data  back  in. 

The  experiments  are  performed  on  Jaguar,  the  Cray  XT4 
at  ORNL,  striping  across  all  144  Lustre  storage  targets  and 
using  a  stripe  size  of  1  MB. 

3.2  Pixie3D  Uniform  Restarts 

To  establish  a  baseline,  a  uniform  restart  is  tested.  This 
is  using  the  same  number  of  processes  to  read  the  restart 
as  wrote  it  originally.  Eigure  4  shows  the  performance  re¬ 
sults.  Eor  large  data,  the  performance  approaches  the  lOR 
benchmark  performance  [15]  for  the  machine. 

3.3  Pixie3D  Small 

Eor  the  small  data,  good  read  performance  just  cannot  be 
attained  no  matter  the  number  of  processes  used  to  read  or 
write  the  data.  The  overall  data  size  is  too  small  to  overcome 
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Figure  4:  Uniform  Reads  for  the  PixieSD  data 

the  inherent  overhead  of  the  parallel  file  system.  Figure  5 
shows  the  performance  for  reading  the  restart  output  on 
half  as  many  processes  as  wrote  the  restart.  The  horizontal 
axis  represents  the  number  of  processes  that  wrote  the  data 
originally. 
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Figure  5:  Small  Model,  Half  Process  Count 

The  BP  formatted  data  was  able  to  be  read  faster  than  the 
netCDF  formatted  data.  The  trendlines  for  the  performance 
clearly  show  the  performance  gap  should  continue  to  widen 
as  the  process  count  increases. 

3.4  Pixie3D  Medium 

Restarts  using  the  medium  data  model  can  achieve  much 
better  performance,  but  still  cannot  achieve  more  than  a 
fraction  of  the  theoretical  maximum  performance  for  the 
system.  Figure  6  shows  the  performance  for  restarting  using 
half  as  many  processes. 

For  the  ‘half’  case,  the  BP  performance  consistently  out¬ 
performs  netCDF  with  the  performance  gap  narrowing  slightly 
as  the  process  count  increases. 

3.5  Pixie3D  Large 

The  large  data  cases  finally  reach  the  maximum  general 
performance  seen  for  applications  in  production  use.  Fig¬ 
ure  7  shows  the  performance  for  using  half  as  many  processes 
to  restart. 


Figure  6:  Medium  Model,  Half  Process  Count 
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Figure  7:  Large  Model,  Half  Process  Count 

3.6  Discussion 

Overall,  the  performance  for  all  configurations  of  BP  data 
is  either  absolutely  better  or  about  the  same  as  a  contiguous 
format  like  netCDF.  Comparing  the  half-processes  restarts 
with  the  uniform  restarts,  the  performance  for  large  and 
medium  data  is  about  80%  of  the  uniform  read  rate.  Small 
data  is  about  the  same  performance.  Though  we  do  not  have 
room  to  show  it  here,  we  also  tested  ADIOS  read  perfor¬ 
mance  on  additional  domain  decompositions,  reading  back 
with  different  numbers  of  readers  (including  restarts  involv¬ 
ing  more  reading  processes  than  writers).  In  most  cases  we 
were  2x  faster  compared  to  reading  directly  from  netCDF, 
and  read  speed  was  always  superior  on  non-uniform  restarts. 

An  additional  set  of  tests  using  a  non-integer  factor  differ¬ 
ence  between  the  writers  and  readers  yields  similar  results. 
For  example,  using  64  writers  and  80  readers  represents  mov¬ 
ing  from  a  4x4x4  setup  to  4x4x5.  For  small  data,  BP  is  con¬ 
sistently  2x  faster  compared  with  netCDF,  for  medium,  it 
is  20%  faster,  and  for  large  data,  the  performance  is  essen¬ 
tially  identical.  The  detailed  results  were  omitted  for  space 
reasons. 

4.  CONCLUSION  AND  FUTURE  WORK 

Previously,  both  ADIOS ’s  BP  format  and  PLFS’s  log- 
based  format  have  shown  excellent  performance  for  writ¬ 
ing  data  due  to  reorganizing  the  data  when  writing.  This 


generated  two  main  open  questions.  First,  what  is  the  per¬ 
formance  for  reading  all  of  the  data,  such  as  for  a  restart. 
Second,  for  reading  a  subset  of  the  data,  such  as  for  a  typical 
analysis  task,  how  much  does  the  log-based  format  penalize 
the  user.  As  has  been  shown,  both  of  these  log-based  for¬ 
mats  do  not  suffer  from  the  expected  penalties  for  storing  the 
data  in  the  log-based  formats  when  reading  restarts.  This 
holds  not  just  for  reading  on  the  same  number  of  processes 
as  wrote  the  data,  but  also  when  fewer  read  the  data  back 
in. 

When  reading  on  the  same  number  of  processes  as  gen¬ 
erated  the  data,  no  reorganization  effort  is  required  for  any 
process  to  read  the  desired  data  and  the  data  is  stored  in 
larger,  single  chunks  on  the  various  storage  targets.  Each 
process  can  read  large,  contiguous  blocks  of  data  from  the 
file  reducing  the  likelihood  of  interfering  with  other  processes 
reading  other  data.  Restarting  on  a  different  number  of  pro¬ 
cesses  is  more  challenging  and  requires  a  deeper  analysis. 

The  belief  that  a  canonical  storage  format  is  most  effi¬ 
cient  for  reading  is  based  on  the  notion  that  most  often, 
contiguous  chunks  of  data  will  be  read.  The  closer  that 
data  is  stored  together,  the  better  the  overall  performance. 
While  this  may  have  been  true  on  a  single  spindle  with  a 
single  reader,  with  parallel  processes  and  file  systems,  a  dif¬ 
ferent  consideration  must  be  made.  Parallel  file  systems  rely 
on  large,  contiguous  reads  or  writes  to  all  storage  targets 
in  parallel  for  optimal  performance.  Although  neither  BP 
nor  the  PLFS  format  is  necessarily  directly  knowledgeable 
about  the  underlying  storage  organization,  the  formats  are 
designed  understanding  that  the  file  system  employed  will 
use  the  strategy  of  striping  the  data  across  many  storage 
targets  while  performing  fewer,  larger  operations  to  attain 
high  performance  on  these  systems. 

The  second  question  of  the  performance  for  data  analysis 
tasks  is  still  open.  These  results  demonstrate  that  a  log- 
based  format  has  viable  performance  for  restarts  of  various 
data  sizes  and  process  counts  but  not  necessarily  superior 
for  read  workloads  overall.  Particularly  missing  is  evidence 
that  the  log-based  format  works  well  for  analysis  patterns. 
For  example,  though  it  still  increases  write  speeds  dramati¬ 
cally,  without  domain  knowledge  of  the  variables  being  writ¬ 
ten  PLFS  is  slightly  penalized  when  reading  back  PixieSD 
data,  unlike  ADIOS.  Both  systems  may  need  to  be  enhanced 
to  handle  analytic  read  patterns  that  differ  drastically  from 
write  and  restart  patterns.  Work  is  ongoing  to  evaluate  how 
the  restart  performance  relates  to  the  analysis  task  perfor¬ 
mance. 
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